A study on the timeseries forecasting method: ARIMA

Data Preparation

The daily data (for confirmed cases in each) has already been preprocessed in update_data.py and is stored the data directory

Now we write a function that takes this cumulative time series and returns a new time series with number of new daily cases for a singule country, for:

  1. Easier trend, seasonality identification, and to
  2. Reduce chances for the cumulative number to go down

For the sake of this study we will be using the time series for India to build our model, but in the process we will also write generic functions and a workflow that can be used for any of the countries in the main dataset, with near optimal parameters.

We will be using holidays of that particular country as an additional feature (besides the past data), for the model to identify patterns.

Model Identification

Depending on the type of our time series, there are 2 to 7 parameters that we will need to identify:

  1. Stationary time series:

A staionary time series is a time series whose value(and covariance) does not depend on time at which the series is observed rather just the lag 'k' wrt to some other point in the series, i.e one that has a constant mean and variance (and thus by default cannot have a non zero trend). An example of a perfectly stationary time series is a sine wave. But as time series are rarely perfectly stationary, we will set a threshold(p-value of 0.05 in adfuller) and use a test (Augmented Dicky-Fuller or adfuller) to identify if the time series is stationary or not.

Forecast of a stationary time series can be found using an ARMA model. It consists of two parts Autoreggressive(AR) and Moving Average(MA). Moving average is a technique that forecasts the future value of a time series data using the average (or of needed weighted average) of the past n values. The AR part is a regression on the time series itself measures/observed at different points with respect to a specified lag k. For example if we were use a AR model with lag 1 or AR(1), the model's equation would be given as follows:

Y(t+1) = μ + βY(t) + ε(t+1)

where μ is the mean of the time series, β is the coefficient of the previous value of the time series, and ε(t+1) is the extra residual term

In this model, we need to identify only 2 parameters/orders: p and q, the lags for the AR and MA processes respectively. The final model will be a ARMA(p,q) or ARIMA(p,0,q)

  1. Non stationary and non seasonal:

If our time series does not satisfy the conditions for the stationary time series, we will use the method of differencing to convert our time series to a stationary time series. This is what the extra 'I' in ARIMA stands for: Integration, i.e, order of differencing.

In this model, the 3 parameters/orders we will need to identify are: p, q (same as that from ARMA) and d, the order of differencing. The final model will be ARIMA(p,d,q).

  1. Seasonal time series:

If our time series has an additional seasonality component, we will need to use SARIMA/SARIMAX model to forecast (Seasonal ARIMA) for the forecast, as seasonality doesnt work well with the standard ARIMA model. For this, in addition to the 3 parameters we found in ARIMA, we will need to identify:

  1. P and Q: the lags of the seasonal component of the time series
  2. D: the order of differencing for the seasonal component
  3. s or m: the number of seasonal periods in the time series. (i.e, if m is 4, each period will be 1/4 of a year, so quarterly. If 12, then monthly etc.)

Step 1

Checking if our time series is stationary or not, and if not, identify the order of differencing.

As our p-value is > 0.05, the time series is not stationary.

Another show of this, is that the decrease/decline in the auto correlations is very slow (shown below)

We will now try the same with the same series after a single difference (d = 1)

After this single difference, we find that our time series satisfies both conditions for being stationary.

We have successfully identified the order of differencing to be 1. (i.e, d = 1)

Step 2:

To identify the lags p and q manually, we can follow the steps:

  1. The partial autocorrelation is significant only for the first p-values/lags and cuts off to zero.
  2. The autocorrelation values is significant only for the first q-values/lags and cuts off to zero.

From these plots we can see the lags upto 2 for both acf and pacf are significant, therfore our manual arrived values of p and q are 2 and 2 respectively.

So the model we arrived at from our analysis is ARIMA(2,1,2) (NOTE: we do not need to and so haven't yet checked or accounted for the seasonal component)

Model Evaluation

Additionally, to both comapre with out analysis and to optimize our model, we can use auto_arima from pmdarima, which works like a grid search for the most optimal parameters

This agrees with our analysis!!

NOTES:

  1. Setting a higher max_p and max_q might result in auto_arima resulting in a different model, but by experimenting with those values, I observed that the model was getting overfitted , and hence stuck to theor max values being 2 and 2 respectively.

  2. We do not take into account seasonality here as ARIMA/SARIMAX is known to face many problems in high frequency data (even weekly data (m=52) causes problems, and ours is daily data (m=365)) and in data where the seasonal cycles are too long, as mentioned in these links:

a. Poor perfomance on long cycles,

b. SARIMAX too blunt for high frequncy data(Answer written by one of the authors of statsmodels himself!!)

Forecasting